SPINDL.REM[UP,DOC]1 - www.SailDart.org

perm filename SPINDL.REM[UP,DOC]1 blob sn#214812 filedate 1976-05-11 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00002 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	.BEGIN
C00020 ENDMK
C⊗;
.BEGIN
.REQUIRE "MEMO.PUB[LET,JMC]" SOURCE;
.cb SPINDL - A PROGRAM FOR FILE COMPRESSION

.CB "by Robert E. Maas (REM)"

.CB "(This file is SPINDL.REM[UP,DOC])"



	Disk space in the Lab is  in  short  supply  and  will
remain  that  way  for  the forseeable future.  Therefore, two
methods of reducing the amount of space  taken  by  files  are
hereby  offered.   The first of these allows combining several
files into one, and the second uses Huffman coding to  further
reduce  the  space  taken  to slightly less than half (in most
cases) its original volume.  The two facilities are offered by
the same program called SPINDL.

	Each  disk  file  consumes  at least a full disk block
(2304 words), even if it contains  only  one  word  of  actual
data.     Most  such  files  are  text  files,  and can now be
combined into a "spindle" so that many fewer disk  blocks  are
consumed.   Later, as desired, individual files in the spindle
may be retrieved.  Since the system DIrectory command measures
files in words rather than blocks, SPINDL will help the system
even if the DIrectory  command  shows  no  reduction  of  word
count.

	To use this new facility, type the monitor command

	R SPINDL

The  program  will announce itself and ask you for the name of
the spindle file you want to use, then prompt you with  a  ">"
at  the left margin.   You type a file name, default extension
is ".SPI".   If the file doesn't already  exist,  it  will  be
created.    The  program will now announce the number of files
currently in the spindle, and print a "*" at the  left  margin
to  indicate  that it is waiting for a command.  If you type a
"?" followed by carriage-return, it will  provide  a  list  of
legal  commands.   The following commands will perform most of
the useful functions you will want:

DIrectory -- tells you all files in the spindle.

Spindle -- copies a file from the disk into the spindle.
(Note, the original remains on the disk, so at present the
user must manually delete it after spindling if he is to
reduce disk-space consumption.)

Unspindle -- copies a file from the spindle to the disk.
(Note, original remains in spindle until explicitly deleted.)

DElete -- deletes a file from the spindle (without  reclaiming
space).

Bubble -- copies the entire spindle, deleting all wasted space
such as deleted files.

	After some of  the  commands  (for  example,  Spindle,
Unspindle,  and  DElete)  the program will ask you for various
items, such as the names of the files to use, and  the  prompt
for  typing  each  such item will be a ">" at the left margin.
When it is all done, it will prompt "*"  at  the  left  margin
indicating it is ready for another top-level command.

CRUNCH-AND-SPINDLE:

	Greater  data compression at the cost of more computer
time is obtained with the SPINDL command "Crunch".   (It  uses
Huffman  coding  to reduce the volume of an English or FAIL or
SAIL, etc. file by a further factor of two).  SPINDL  will  ask
you  for  auxiliary  files  called  the  history  tree and the
Huffman tree, but carriage return will use default  trees  (if
the  spindle  already contains a crunched file, default is the
same trees as used the previous time, otherwise default  is  a
particular  pair  of  trees  on [UP,REM] which work quite well
with most English language text such as writeups and  essays).
Regardless  of  whether  the default trees or some other trees
are used, one copy of each tree actualy used  will  be  copied
into  your  spindle  file (if it isn't there already), so that
crunching only achieves a reduction of data if the files
being  crunched with a particular pair of trees total at least
1400-3000 words (assuming a 2-to-1  compression  and  assuming
700-1500  word trees). Naturally, it is also pointless to spindle a
single file without crunching, or to crunch a single file unless its size exceeds
2.3K.
	Appendix  A  explains  how  to  make  special   trees,
assuming none of the already-existing trees listed in Appendix
B are adequate. (Currently only trees for English  exist,  but
soon  an  attempt will be made to create trees for FAIL, SAIL,
LISP, and perhaps LAP and SNOBOL.)

	To uncrunch-and-unspindle, just  use  the  "Unspindle"
command  and  the  right  thing will happen automatically, the
correct trees will be selected from the spindle file according
to their hash number and will used for uncrunching your file.

Timing:
	Spindle -- 3/4 second per thousand words.

	Unspindle -- 1/2 second per thousand words.

	Crunch-by-pages-and-spindle -- 4 sec to  load  default
trees + 2 sec per thousand words.

	Uncrunch-by-pages-and-spindle  -- 3.3 sec to load trees
+ 1 sec per thousand words.

	When  should  a  file  be  SPINDLed?     Our   current
recommendation  is  that  a file not expected to be used in 30
days should be spindled, and if not expected to  be  used  for
another  15  days,  crunched  too.   These recommendations are
conservative, and will change in the direction of more  prompt
spindling when the KL-10 is working.

	Many people will be able to double the effective sizes
of their disk allocations by using these facilities.

.skip
.CB APPENDIX A
.at 8 ⊂ BREAK; ONCE INDENT 8; ⊃;

	At present it is rather a hassle to create new history
and  Huffman  trees  for  encoding  a file, however here is an
explanation of how to do it should you really want to.  If you
already have a list of reversed strings to use, or want to use
one of REM's, skip over steps 1-5 and begin at step 6.  If you
already  have  the  two  Polish  files  you  want  to  use for
crunching, you may begin at step 7.

STEP 1 -- Create a list of strings occurring most often:
	RU CRU1[1,REM] -- Starts up the survey program.
	B -- Sets the program to Bigram/Trigram mode in  which
it scans a file for the strings which occur most often.
	The program will ask you for input file name and write
TOKENS.DAT.

STEP  2  --  Modify  the  list  of strings using your favorite
editor, employing whatever heuristics you have for eliminating
undesired  strings  or  adding  strings  you  believe  will be
useful.

STEP 3 -- Reverse the  list  of  strings  so  that  they  read
backwards  (from  a  given  point  in  a  file back to earlier
characters):
	RU CRU1[1,REM]
	R -- Sets the program to Reverse mode.
	TOKENS.DAT -- Tells it the name of the file to read.
	The program will write SNEKOT.DAT.

STEP 4 -- Separate the three parts of the file and  sort  them
individually into lexicographic sequence:
	ET SNEKOT.DAT/N
	Locate   the  point  where  the  strings  change  from
two-characters to one-character,  and  the  point  where  they
change  from  one-character  to  three-characters.  Put a page
mark at each break.
	E -- Exit the editor.
	DO SNE[1,REM] -- Selects the correct  pages,  creating
SN.1  SN.2 and SN.3 containing strings of a particular length,
assuming your edit was correct.
	R SSORT -- Starts up the String-Sorter program.
	SN.2 -- Sort the 2-letter strings, when it asks you if
it is ok to overwrite, say Y.
	R SSORT
	SN.3
	COPY <FILE>/N/L←SN.1,SN.2,SN.3 -- Whatever <FILE>  you
want,  check  it  to be sure the strings are sorted by length,
and in each group alphanumerically.

STEP 5 --  Modify  the  list  of  reversed  strings  again  if
desired.  Seeing  them  alphabetized  according  to  their new
backwards-format may suggest additional heuristics  for  purge
or addition.

[If  you  jumped  over  steps  1-5,  there is a spindle called
SNEKOT.SPI[1,REM] which contains  various  lists  of  reversed
strings you may want to use.]

STEP  6 -- Scan your collection of files according to the list
of left-contexts defined by  the  reversed  strings  you  have
created or found:
	RU  HOTER[HOT,REM]  or  R  PTYJOB  --  whichever  your
favorite  pty-job  program,  have it output a disk file so you
can save all the TTY output containing the  Huffman  code,  so
that  you  can  examine it sometime later to try to figure out
how to improve it by deleting or inserting  left-contexts.  If
you  are  sure  you  won't want to examine the code, omit this
PTYJOB/HOTER part and just do the following directly  on  your
terminal.
	RU CRU1[1,REM]
	S -- Sets the program to Scan-by-Reversed-Left-Context
mode.
	<FILE>   --   The   list-of-left-contexts   file   you
laboriously created.
	The  program  will  ask you one at a time for files to
survey according to the list of left-contexts that  were  read
in,  compiling  a  complete  '200 word table of how many times
each ASCII character occurs in that  left-context.   When  you
are  finished  with  all the files you want to include, type a
blank line instead of a file name.
	The  program  will  now type out all the totals, which
you want to suppress by control-o or escape-o depending on the
type of terminal you are on.
	After all that, output will be turned back on  by  the
program, it will write out the file HIST.POL which is a Polish
description of all the left-contexts (typically this  file  is
between 50 and 200 words).
	After  that,  the  program  will  begin  computing  an
abbreviated  Huffman  code  for each left-context, and type on
the TTY or PTY the input bits, output bits, compression-ratio,
and  a  complete  description  of  the  Huffman  code  in nice
readable verbose format.  If  you  are  running  this  through
HOTER  you  will  probably  want  to  do a local ↑O command to
suppress output to TTY without affecting the  copying  of  PTY
output  to  the  disk  file.   (PTYJOB  does  not provide this
facility, only HOTER.) This process will probably  take  about
15  minutes  because of the volume of data being inefficiently
copied, all the while  the  file  HUFFS.POL  will  be  written
containing   each  Huffman  code  generated  (typically  about
500-2000 words total).
	You  now  have  HIST.POL  and  HUFFS.POL  containing a
complete description of the code to be used.

[If you jumped directly here, there are several useful history
and  Huffman trees on [UP,REM] with extensions *.HIS and *.HUF
respectively.]

STEP 7 -- When using the  Crunch-by-pages-and-spindle  command
in  SPINDL,  specify  the  files  you created in step 6 or the
files you found already existing *.HIS and *.HUF --  only  one
copy  of  each different tree will be kept in the spindle that
contains files crunched by it.

.skip
.CB APPENDIX B

	All existing trees are  in  the  disk  area  [UP,REM].
History  trees  have  extension  .HIS  and  Huffman trees have
extension .HUF. The following table tells which are  available
and what files they are good for.

.END
.BEGIN NOFILL
.FONT A "GACL25";SELECT A
	*.HIS	*.HUF	type of file it is for
	NOTSEV	NOTIC1	English text (surveyed NOTICE[UP,DOC]) (DEFAULT TREES)
	NOTSEV	SEVEN1	English text (same left-contexts but larger survey)
.END